Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling CSE encryption for COPY command #377

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ameent
Copy link

@ameent ameent commented Nov 28, 2017

With these changes one can write encrypted data to S3 using client side encryption, with a custom symmetric master key, and then use spark-redshift to ingest the data. I have sensitive data ingesting into Redshift and can't use S3-SSE for my data.

The main change was to switch away from WITH CREDENTIALS syntax and explicitly pass iam_role, access_key, etc. parameters so that the end-user can use the "extracopyoptions" to supply their symmetric key.

Usage:

sparkSession.table(name)
      .write
      .format("com.databricks.spark.redshift")
      .option("url", redshiftUrl)
      .option("dbtable", options(REDSHIFT_TABLE_NAME))
      // If the table does not exist, then it will be created. If it exists, then data will be appended to it
      .mode(SaveMode.Append)
      .option("temporary_aws_access_key_id", options(AWS_ACCESS_ID))
      .option("temporary_aws_secret_access_key", options(AWS_SECRET_KEY))
      .option("temporary_aws_session_token", options(AWS_SESSION_TOKEN))
      .option("tempdir", options(REDSHIFT_TEMPORARY_S3_LOCATION))
      // we must encrypt the data when writing to S3, and we have to pass
      // the symmetric encryption key to Redshift so that it can decrypt the data.
      // See http://docs.aws.amazon.com/redshift/latest/dg/c_loading-encrypted-files.html
      .option("extracopyoptions", s"encrypted master_symmetric_key '$encodedSymmetricKey'")
      .save()

Ameen Tayyebi added 4 commits November 28, 2017 10:10
Changed usage of the CREDENTIALS command for Redshift to specific
keywords. In other words, the commands produced will no longer be
of the form:
CREDENTIALS('access_key=X&secrety_key=Y')
and will instead be of the form
access_key = 'X' secret_key = 'Y'

This is needed because for loading encrypted payloads into Redshift using
client side encryption, one
needs to place symmetric_master_key as an argument on the copy command, however,
it is also an options within the CREDENTIALS command, so if a query to Redshift
includes a CREDENTIALS clause and also symmetric_master_key, then Redshift will
report this error:
com.amazon.support.exceptions.ErrorException: Amazon Invalid operation: conflicting or redundant options;
When data is encrypted in S3 and a COPY command is invoked, it's expected
that the manifest is not encrypted and is in plain-text form.

Encryption on EMR through EMRFS is controlled by a Hadoop option (fs.enable.cse).
Once set, all data that goes through the file system will be encrypted.

This commit adds an exception around generation of manifest files so that even
if the encryption option is set, the manifest file created on S3 is not encrypted.

This enables Redshift to read the manifest and ingest the data even for cases where
data is encrypted on the client side with a symmetric encryption key.
Switching the UNLOAD statement to no longer use the WITH CREDENTIALS
method and instead rely on explicitly passing the role, access key,
secret key, session token, etc.

Generally speaking this is a more flexible way of passing credentials,
though for UNLOAD it doesn't make much difference. This change is
pursued to achieve consistency with the COPY command. In COPY command,
this change is necessary to enable copy of client-side encrypted
data with Redshift.

http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-authorization.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant